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SSOR preconditioning of fermion matrix inversions which is parallelized using a locally-lexicographic lattice 
sub-division has been shown to be very efficient for standard Wilson fermions. We demonstrate here the power of 
this method for the Sheikholeslami-Wohlert improved fermion action and for a renormalization group improved 
action incorporating couplings of the lattice fermion fields up to the diagonal in the unit hypercube. 



1. Introduction 

Recently, the symmetric successive over- 
relaxed preconditioner (SSOR) turned out to 
be parallelizable by means of the Zocally- 
Zexicographic ordering technique Q . In this way, 
SSOR preconditioning has been made applicable 
to the acceleration of standard Wilson fermion in- 
versions on high performance massively parallel 
systems and it outperforms o/e preconditioning. 

It appears intriguing to extend the range of 11- 
SSOR-preconditioners such as to accelerate the 
inversion of improved fcrmionic actions, which be- 
came very popular in the recent years. 

In Symanzik's on-shell improvement program 
|2], counter terms are added to both, lattice ac- 
tion and composite operators in order to reduce 
0(a) artifacts which spoil results in the instance 
of the Wilson fermion formulation. In the ap- 
proach of Sheikholeslami and Wohlert (SWA) jjj, 
the Wilson action is modified by adding a diag- 
onal term, the so-called clover term with a new 
free parameter csw- 

Perfect lattice actions are situated on renormal- 
ized trajectories in parameter space that intersect 
the critical surface (at infinite correlation length) 
in a fixed point of a renormalization group trans- 
formation. Perfect actions are free of any cut- 
off effects, but in practice they can only be con- 
structed approximatively. A promising approach 



for asymptotically free theories is the use of clas- 
sically perfect actions [§J to serve as an approx- 
imation to perfect ones. Moreover, practical ap- 
plications require a truncation of the couplings to 
short distances (truncated perfect actions, TPA). 
In the present investigation, we consider a variant 
of the hypercube fermion (HF) approximation 

The generic form of both SWA and TPA is 
given by 

M = D + A + B + C + E... (1) 

D stands for 12 x 12 diagonal sub-blocks, 
A, £?,... are nearest-neighbor, next-to-nearest- 
neighbor,. . . hopping terms. In the following, we 
will show that the H-SSOR scheme applies not 
only to the couplings in A but also to the inter- 
nal spin and colour d.o.f. of D (SWA) as well as 
all the couplings of B, C, and E, ... of TPA. 

2. SWA and HF Actions 

SWA is composed of A (Wilson hopping term) 
and D (SW diagonal): 

csw \ - p , 
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+ (l + 7/*)^-|i(»)^,j/+/J . 

k is the Wilson hopping parameter, csw couples 
the SW clover operator. This parameter is tuned 
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to optimize O(a) cancellations. The clover term 
F^uix) consists of 12 x 12 diagonal blocks. Its ex- 
plicit structure in Dirac space is given in Ref. ||] . 

As a prototype TPA we have investigated the 
perfect free action constructed in Ref. by 
means of block variable renormalization group 
transformations for free fermions. The exponen- 
tial decay of their couplings is fast, and therefore 
they can be truncated to short range ||. We 
limit ourselves to couplings up to 4-space diago- 
nals in the unit hypercube (hypercube fermion, 
HF). The gauge links are introduced in an obvi- 
ous way: we connect all the coupled sites by all 
possible d\ shortest lattice paths in a d diagonal, 
by multiplying the compact gauge fields on the 
path links. For a given link, we average over all 
d\ paths from hyper-links Uj^ + „ 2+ _ + n,(%) built 
up recursively: 
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+ U^(x)ul d - + l + ^ d _ i (x + [ ld ) 
Defining effective T's by 

ri^i^i...^ = A d + K d (±7 A11 ±7 AI2 ±...7, I J,(4) 

with the HF hopping parameters m and Ai, we 
arrive at: 



Dhf(x, y) = \o6 x ,y 
A HF (x,y)=Y, [T +IA Ujp{x)6 w , 
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It is straightforward to write down the expres- 
sions for Chf an d Ehf- Altogether 80 hyper- 
links contribute while 40 have to be stored. 



3. Block SSOR Preconditioning 

The preconditioned system is modified by two 
matrices V\ and V2, 



V 2 il>. 



(6) 



Let M = D — L — Ube the decomposition of M 
into its block diagonal part D, its (block) lower 
triangular part — L and its (block) upper trian- 
gular part —U. Block SSOR preconditioning is 
defined through the choice 
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D - U . (7) 



The Eisenstat trick |l| reduces the costs by a fac- 
tor 2. It is based on the identity: 
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The preconditioned matrix- vector product, z — 
V^MV^fx, is given by: 

solve (1 — U)U D~ x )y = x 
compute w = x + (lu — 2)y 
solve (1 — ujLD~ 1 )v — w 
compute z = v + y 

The "solve" is just a simple forward (backward) 
substitution process due to the triangular struc- 
ture: 

for i = 1 to N 

v t = to, + LijSj 

Si = wDT^Vi 

Options for D of SWA take each block Da to 
be of dimension 12 (£> (12) ), 6 (£> (6) ), 3 (£>( 3 )) or 
1 (D^), as suggested by the structure of D. The 
blocks have to be pre-inverted the costs depend- 
ing on the block size 

Parallelism can be achieved by locally lexico- 
graphic ordering Q. "Coloring" is the decom- 
position of all lattice points into mutually dis- 
joint sets Ci,...,Cfc (with respect to the ma- 
trix M), if for any I 6 {l,...,k} the property 
x e Ci => y £ Ci for all y e n(x) holds. n(x) 
denotes the set of sites ^ x coupled to x. A 
suitable ordering first numbers all x with color 
Ci, then all with C2 etc. Thus, each lattice 
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point couples with lattice points of different col- 
ors only. The computation of v x for all x of a 
given color Ci can be done in parallel, since terms 
n ^e J2yen(x), y< oX involve only lattice points of 
the preceding colors Ci, . . . , with x < y 

meaning that x has been numbered before y with 
respect to the ordering o. 

Let the lattice blocks be of size n loc = n° c x 
n l 2° x rig 00 x n l £ c . A different color is associated 
with each of the sites of the n loc groups. A locally 
lexicographic (11) ordering is defined to be the 
color ordering, where all points of a given color 
are ordered after all points with colors, which cor- 
respond to lattice positions on the local grid that 
are lexicographically preceding the given color. 
The parallel forward substitution reads: 

for Cj, i = 1, . .., 2, fracnp G N 
for all processors j = 1, . . . ,p 
x := with Ci on j 

V x — W x + Sj/£n(z), y< u x L X ySy 

s x = u)D-£v x 

For SWA, up to 8 and for HF all 80 neighbors 
may be involved on the 4-d grid jlj] . 

4. Improvement 

The SWA has been implemented on an 
APE 100. HF is benchmarked on a DEC alpha 
workstation. For SWA, we use a de-correlated 
set of 10 quenched gauge configurations generated 
on a 16 4 lattice at f3 = 6.0 at 3 values of csw, 0, 
1.0 and 1.769. We have applied BiCGStab as it- 
erative solver. The stopping criterion has been 
chosen as ^wtt^ < 10~ 6 , with X being the so- 
lution. We used a local source (j> and determined 
the optimal OR parameter to be about ui = 1.4 
for all block sizes and csw- 

We plot the ratio of iteration numbers of the 
odd-even procedure vs. Z/-SSOR as function of k 
in Fig. [|. A gain factor up to 2.5 in iteration 
numbers can be found. There is no dependence 
on csw or on the block size of D and only 10 % 
on the local lattice size. As to real CPU costs on 
APE 100, the optimal block size of D is a 3 x 3 
block whereas on a scalar system, the optimum is 
found for a 1 x 1 diagonal. 

Limited by the number of hyper-links to store 
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Figure 1. Gain of H-SSOR over o/e vs 
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Figure 2. Dependence of the solution (in CPU 
time) on the mass parameter m. 

on the DEC system, we decided to investigate HF 
on a lattice of size 8 4 . We measured at (3 = 6.0 
in quenched QCD. We have assessed the critical 
mass parameter to determine the critical region 
of HF. For HF uj ~ 1.0 is optimal. We find that 
SSOR preconditioning of HF leads to gain factors 
~ 3 close to the critical bare mass m c = —0.92. 
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